Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

170 ◾ Bioinformatics

For indexing the human genome, STAR may require more than 32GB of RAM and it may

take a long time depending on the computer RAM and the number of processors used.

Once indexing has been created, you can align the reads in FASTQ files to the reference

genome sequence. Since the practice sequence data is paired end, there are two FASTQ files

(r1 and r2) for each sample. STAR requires these two files as inputs in addition to the refer-

ence genome index we created above. Instead of running STAR for each sample, we can use

bash script as follows to align the reads of the sample files in “fastq” directory:

mkdir starout

mkdir bam

cd fastq

for i in $(ls *.gz | rev | cut -c 13- | rev | uniq);

STAR --runThreadN 4 \

--runMode alignReads \

--outSAMtype BAM SortedByCoordinate \

--readFilesCommand zcat \

--genomeDir ../indexes \

--outFileNamePrefix ../starout/${i} \

--readFilesIn ${i}_r1.fastq.gz ${i}_r2.fastq.gz

done

for f in $(ls *.bam | rev | cut -c 30-|rev);

mv ${f}Aligned.sortedByCoord.out.bam ../bam/${f}.bam

done

cd ..

In the above, first we created a directory “starout” for STAR output and “bam” for the BAM

files. The Linux bash command “$(ls *.gz | rev | cut -c 13- | rev | uniq)” lists the names of

the FASTQ files found in the directory and cut the common name (ID) for the files of each

sample. The “for loop” will provide that common name as a variable “${i}” to assemble the

FASTQ file names and the output file name prefix to the STAR command each time until

all FASTQ files are processed. There will be output files including alignment log files and a

BAM file for each sample (6 BAM files for all example data). The second “for loop” renamed

the long BAM file names and moved the files to the “bam” subdirectory.

The bash scripting comes in handy whenever there is a task that needs to be repeated

multiple times.

The STAR output log files provide important statistics about the read alignment. You

can refer to STAR manual to read about the output log files.

Before moving to the next step, you need to index the BAM files using “samtools index”

command as follows:

#index bam

cd bam

for i in $(ls *.bam);